Accelerated Distance Computation with Encoding Tree for High Dimensional Data

نویسندگان

  • Shicong Liu
  • Junru Shao
  • Hongtao Lu
چکیده

We propose a novel distance to calculate distance between high dimensional vector pairs, utilizing vector quantization generated encodings. Vector quantization based methods are successful in handling large scale high dimensional data. These methods compress vectors into short encodings, and allow efficient distance computation between an uncompressed vector and compressed dataset without decompressing explicitly. However for large datasets, these distance computing methods perform excessive computations. We avoid excessive computations by storing the encodings on an Encoding Tree(E-Tree), interestingly the memory consumption is also lowered. We also propose Encoding Forest(E-Forest) to further lower the computation cost. E-Tree and E-Forest is compatible with various existing quantization-based methods. We show by experiments our methods speed-up distance computing for high dimensional data drastically, and various existing algorithms can benefit from our methods. Introduction The rapid development of the Internet in the recent years brings explosive growth of information online. Researchers have been developing methods utilizing such huge amount of data for machine learning, information retrieval, computer vision, etc. Because the majority of large-scale datasets consists of high-dimensional data, there is an increasing requirement for efficient basic operations like evaluating distance and computing scalar product. Product Quantization (PQ)(Jegou, Douze, and Schmid 2011) is a typical method for fast distance computation/scalar product on high-dimensional data. PQ compress highdimensional data into short encodings, and is able to evaluate distances or scalar product between uncompressed and compressed vectors without explicit decompression. Given a d-dimensional dataset, PQ compress a dataset by first splitting the vector dimensions into M groups, then quantize each dimension group separately to generate M codebooks containing K codewords (each codeword has d/M dimensions). Finally we pick one codeword form each codebook to encode an input vector. The compressed vector has M Copyright © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. parts, each part occupies log2 K bits. An encoded vector is approximated (decompressed) by the concatenation of M codewords assigned. Computing distances between N pairs of PQ compressed vectors and an uncompressed vector x can be efficiently done in O(MN) time, via a smart use of lookup tables. It is introduced as Asymmetric Distance Computing (ADC) in (Jegou, Douze, and Schmid 2011). One can easily extend the idea to allow efficient scalar product computation(Du and Wang 2014), etc. PQ enables efficient Approximate Nearest Neighbor search, where PQ achieves favorable memory / speed vs accuracy trade-offs against several competitive methods including Hashing based schemes and Tree based schemes(Ge et al. 2013), (Norouzi and Fleet 2013). Researchers also developed various quantization methods motivated by Product Quantization. e.g. Tree Quantization(Babenko and Lempitsky 2015), Composite Quantization(Ting Zhang 2014), Cartesian K-means(Norouzi and Fleet 2013), Additive Quantization(Babenko and Lempitsky 2014), etc, to further lower the quantization error. Existing problem: Though ADC is efficient compared to directly computing the distances, it still does excessive computations. Existing vector quantization methods simply store the encodings sequentially in the memory, and exhaustively perform ADC to compute the approximate distance. However in any quantized dataset, many encodings share the same prefixes. These prefixes are repeatedly computed with ADC, they also take up excessive memory. Our contribution: In this paper, we propose Encoding Tree(E-Tree) to lower the memory consumption and speedup the distance computation for encodings generated with vector quantization methods. An E-Tree is a compact version of prefix tree with the nodes having only one leaf child recursively merged. We propose Hierarchical Memory Structure for Encoding Tree which is designed for efficient depth first traversal and allow accelerated distance computation. To perform accelerated distance computation, we maintain a very short ”partial” ADC results, and depth-first traverse the tree. The accelerated distance computation is cache friendly and easily paralleled as it sequentially access the memory. Interestingly, with Hierarchical Memory Structure, we’re able to speed up distance computation as well as lower the memory consumption. For further speed up one can generate an ar X iv :1 50 9. 05 18 6v 2 [ cs .C V ] 1 8 Se p 20 15 Encoding Forest by generating multiple E-Trees on different parts of the encodings, at a slight cost of memory consumption. As a method for fast distance computation, E-Tree/EForest are totally compatible with various existing quantization methods by simply substitute ADC with E-Tree/EForest for distance computation. E-Forest achieves up to 111.7% speedup compared to the naive ADC, and E-Tree lower the memory consumption by 12.5%. E-Tree/E-Forest can accelerate various related algorithms significantly, e.g. Locally Optimized Product Quantization by 74%, and IVFADC by 81%. Applications relying on efficient distance computation could greatly benefit from our methods. Related Work Vector Quantization is commonly applied on highdimensional data for efficiently manipulating the data like computing distances between vectors. It essentially maps a vector to a codeword, and use the codeword to approximate the original vector. Take Product Quantization as an example, it first decompose the original data space as the Cartesian Product of M disjoint lower dimensional subspaces, and learn M codebooks Cm = {cm(1), · · · , cm(K)},m = 1, · · · ,M for each subspace. Then we encode a vector x with Cm on the corresponding dimensions to produce an M -encoding: x → i1(x), i2(x), · · · , iM (x). Padding the codewords with zero chunks to obtain full dimensional codewords, vector x can be reconstructed as x ≈ c1(i1(x)) + c2(i2(x)) + · · ·+ cM (iM (x)). We can perform Asymmetric Distance Computation(ADC) introduced in (Jegou, Douze, and Schmid 2011) to compute the distance between a vector and quantized vectors. The Euclidean distance between a vector q and a database vector x is approximated by:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A method for 2-dimensional inversion of gravity data

Applying 2D algorithms for inverting the potential field data is more useful and efficient than their 3D counterparts, whenever the geologic situation permits. This is because the computation time is less and modeling the subsurface is easier. In this paper we present a 2D inversion algorithm for interpreting gravity data by employing a set of constraints including minimum distance, smoothness,...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

Fast Computation of the Tree Edit Distance between Unordered Trees Using IP Solvers

We propose a new method for computing the tree edit distance between two unordered trees by problem encoding. Our method transforms an instance of the computation into an instance of some IP problems and solves it by an efficient IP solver. The tree edit distance is defined as the minimum cost of a sequence of edit operations (either substitution, deletion, or insertion) to transform a tree int...

متن کامل

Convex Optimization of Low Dimensional Euclidean Distances Convex Optimization Learning of Faithful Euclidean Distance Representations in Nonlinear Dimensionality Reduction

Classical multidimensional scaling only works well when the noisy distances observed in a high dimensional space can be faithfully represented by Euclidean distances in a low dimensional space. Advanced models such as Maximum Variance Unfolding (MVU) and Minimum Volume Embedding (MVE) use Semi-Definite Programming (SDP) to reconstruct such faithful representations. While those SDP models are ca...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1509.05186  شماره 

صفحات  -

تاریخ انتشار 2015